Goal: Assess the quality of Illumina paired-end reads associated with a bacterial pathogen, before and after trimming
Input: Raw Illumina paired-end reads in gzipped fastq format (*.fastq.gz)
Output: Trimmed Illumina paired-end reads in gzipped fastq format (*.fastq.gz)
Programs Used:computer hardware: physical elements of a computer
computer software: instructions which tell computer hardware how to perform a task
application software: a division of computer software; software which uses the computer system to perform functions beyond the basic operation of the computer itself
system software: a division of computer software; software used to manage computer hardware behavior and provide a platform for other software to run
operating system (OS): a division of system software; system software that manages computer hardware and software resources and provides common services for computer programs. Examples include Microsoft Windows and macOS.
kernel: component of an operating system; software at the core of a computer’s operating system which is used to mediate access to the computer’s resources
central processing unit (CPU): electronic circuitry within a computer which is responsible for running or executing programs; CPUs carry out the instructions of a computer program by performing the basic arithmetic, logic, controlling, and input/output (I/O) operations specified by the instructions
CPU core: the processing unit of a CPU
thread: thread of execution; an ordered sequence of instructions that tells the computer what to do
random-access memory (RAM): a form of computer data storage which stores data and machine code that are currently being used
Unix shell: also known as a shell; a command-line interpreter that provides a command line user interface for Unix-like operating systems
terminal emulator: also known as a terminal; a text-based interface for typing commands, allowing users to interact with a Unix shell
graphical user interface (GUI): a form of user interface that allows users to interact with electronic devices through graphical icons and visual indicators
Unix is a family of operating systems that descend from the original Unix, which was developed in 1969 by Ken Thompson, Dennis Ritchie, Douglas McIlroy, and Joe Ossanna at AT&T’s Bell Laboratories in the United States (Ritchie, 1984, AT&T Bell Laboratories Technical Journal). In 1984, Bell Labs began distributing Unix as a proprietary product which users were not allowed to modify. Unix operating systems with which you may already be familiar are Apple’s macOS and Darwin (a component of what is now known as Apple’s macOS).
In 1983, Richard Stallman began the GNU Project with the goal of creating a completely free Unix-compatible software system (in Stallman’s own words: “Once GNU is written, everyone will be able to obtain good system software free, just like air.”). By 1990, nearly all of the major components of a GNU operating system were finished; however, one important piece was still missing: the kernel, which is where Linus Torvalds comes in. In 1991, while attending the University of Helsinki, Torvalds created a kernel which would later be released as the Linux kernel in 1992. By combining Linux with the GNU system, a fully functioning operating system was born. GNU/Linux, or Linux, refers to this family of free, open-source, Unix-like operating systems based on the Linux kernel. Popular Linux distributions include Ubuntu, Debian, and Red Hat.
For this workshop, we will be doing nearly all of our computing in a Linux environment because
Bash (Bourne-Again shell) refers to a Unix shell (a command-line interpreter that provides a command line user interface for Unix-like operating systems) and command language. It was written by Brian Fox for the GNU Project, with a beta version released in 1989. Since then, it has become the most popular Linux shell, as it is the default interactive shell for users on most Linux and macOS operating systems (which is why we will be using it for this workshop!). Bash allows users to type commands in a specific language, which the computer then executes.
Before we start, it’s important to understand the structure of the Linux file system so we can navigate it from the command line.
If you’re a Windows user, you may be familiar with a file system that looks something like this:
Mac users might understand their file systems as something that looks like this:
The Linux file system isn’t so different; some popular Linux distributions, such as Ubuntu, might even have a very nice GUI, like the one below:
However, when working from the command line, our view won’t be so pretty. But even without a GUI, we can imagine the underlying structure of the Linux file system as something resembling an upside-down tree, like the image below:
Files, such as text files, image files, fastq files, fasta files, etc. are inside of directories. Directories are just another name for what one might call a “folder” in the Windows or Mac worlds. In the tree above (Figure 8), directories have green boxes around them, while files don’t.
If you look at the top of the tree (Figure 8), you’ll see a directory called the root directory, denoted by a forward slash (/). You can think of the root directory as the trunk of a tree from which all branches (paths) start.
One important thing to note before we begin is the difference between
absolute paths and relative paths, which can be a
frustrating concept for Linux command-line beginners. Paths are used to
specify a particular file or directory; the full path, or
absolute path to a file or directory starts from the root, /, and
follows the branches of the files system, with each directory separated
by a slash (/), until the file or directory of interest is reached. For
example, in the tree above (Figure 8), if our file of interest is
fluffy.jpg, the absolute path to fluffy.jpg is
/home/sue/Pictures/pets/fluffy.jpg (just follow the shaded
green boxes!). A relative path, on the other hand, specifies the path to
a directory or file relative to another; usually, this is the directory
you are currently in, also known as the working directory. So,
for example, in the picture above (Figure 8), if we are currently inside
the home directory, the relative path to
fluffy.jpg is sue/Pictures/pets/fluffy.jpg
(see how we didn’t start with a forward slash? That’s because we didn’t
start at the root directory). If we move to the Pictures
directory, our relative path to fluffy.jpg is
pets/fluffy.jpg. Finally, if we move to the
pets directory, the relative path to
fluffy.jpg is just fluffy.jpg! However, the
absolute path to fluffy.jpg remains
/home/sue/Pictures/pets/fluffy.jpg.
Also, note that paths, file names, and directories are
case-sensitive. For example, in the image above, if you search
for fluffy.jpg at the following path
/home/sue/pictures/pets/fluffy.jpg, Bash will throw an
error saying that your file doesn’t exist. Why? Because the actual path
is /home/sue/Pictures/pets/fluffy.jpg, where the
Pictures directory has a capital “P”. If you don’t type the
path correctly, Bash will look for a directory named
pictures within /home/sue/, which doesn’t
exist.
Finally, note that you should try to avoid naming files and directories with special chacters, particularly whitespace, slashes, asterisks, and other special characters. You’ll notice that bioinformaticians love underscores and dashes, so consider using those instead of hitting the space key when you’re naming a file or directory!
activity1
directory:cd activity1
If we type pwd, we will see that we are now inside of
activity1:
pwd
If we type ls, we will see that our current directory
(activity1) is empty:
ls
The isolate’s genome was sequenced using paired-end reads: forward
reads are stored in a file called
mystery_salmonella_A_1.fastq.gz, and reverse reads are
stored in a file called mystery_salmonella_A_2.fastq.gz.
Both of these files are in fastq format (.fastq),
compressed using gzip (.gz; a common method for compressing
genomic data).
Let’s download the forward reads from the remote server to our local
computer using the wget command:
wget http://beagle.henlab.org/public/course_introtobioinfo/mystery_salmonella_A_1.fastq.gz
If we type ls, we should see our forward reads in our
current directory:
ls
Now, let’s download the reverse reads:
wget http://beagle.henlab.org/public/course_introtobioinfo/mystery_salmonella_A_2.fastq.gz
If we type ls, we’ll see that we now have forward and
reverse reads in our current directory:
ls
To install FastQC, let’s create a conda environment named
fastqc by running the following:
conda create -n fastqc
Once that finishes, let’s activate our environment:
conda activate fastqc
Now, let’s install FastQC using conda, as described on
the conda
website:
conda install bioconda::fastqc
Once that finishes, we can check if FastQC is installed properly by running
fastqc --version
This will print the version of FastQC that is installed.
We can also run the following to print the FastQC help message (explains how to run FastQC, the different options and what they do, etc.):
fastqc -h
The general format for running FastQC from the command line is as follows (note: don’t execute this command; it is just an example):
fastqc </path/to/first_reads.fastq.gz> </path/to/second_reads.fastq.gz> </path/to/third_reads.fastq.gz> </path/to/nth_reads.fastq.gz>
Let’s try running FastQC with our forward and reverse reads. Because they are already in our current directory, we can just supply the names of the files:
fastqc mystery_salmonella_A_1.fastq.gz mystery_salmonella_A_2.fastq.gz
Once FastQC is finished, we can type ls and see that we
now have four additional files: a .zip and a
.html file for our two .fastq.gz files with
our forward and reverse Illumina raw reads:
ls
fastqc conda environment:conda deactivate
Here, we will use Trimmomatic to:
A. Trim any Illumina adapters that are left over from the Nextera library preparation
B. Use a sliding window approach to measure average quality within a 4-base window, trimming when the average per-base quality drops below a Phred quality score of 15
C. Remove trailing bases (i.e., bases at the end of a read) which drop below a Phred quality score of 3
D. Remove leading bases (i.e., bases at the start of a read) which drop below a Phred quality score of 3
E. Remove reads less than 36 bp in length (note that employing a minimum read length filter like we are doing here can be especially important for analyses which require counting of mapped reads, such as RNA-Seq)
Let’s get trimming! Back in your terminal, make sure you are in your
activity1 directory with the forward and reverse reads from
our Mystery Salmonella A isolate
(mystery_salmonella_A_1.fastq.gz and
mystery_salmonella_A_2.fastq.gz):
pwd
should show that we are in our activity1 directory
ls
should show mystery_salmonella_A_1.fastq.gz and
mystery_salmonella_A_2.fastq.gz
(If not: cd ~/activity1)
conda. First, let’s
create a conda environment named trimmomatic:conda create -n trimmomatic
Once our environment has been created, let’s activate it:
conda activate trimmomatic
Now, let’s install trimmomatic as described on the conda website:
conda install bioconda::trimmomatic
We can check that Trimmomatic has been installed properly by running the following:
trimmomatic -version
This command will print the version of Trimmomatic that has been installed.
The command to run Trimmomatic for paired-end Illumina reads takes the following form (note: don’t execute this command; it is just an example):
trimmomatic PE -threads 1 -phred33 -trimlog log.txt -summary trimmomatic_summary.out input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:adapters-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
For this command, we are:
trimmomatic)PE)-threads 1)-phred33)-trimlog log.txt)-summary trimmomatic_summary.out)input_forward.fq.gz)input_reverse.fq.gz)output_forward_paired.fq.gz)output_forward_unpaired.fq.gz)output_reverse_paired.fq.gz)output_reverse_unpaired.fq.gz)ILLUMINACLIP:adapters-PE.fa:2:30:10)LEADING:3)TRAILING:3)SLIDINGWINDOW:4:15)MINLEN:36)Notice above that we need to tell Trimmomatic (i) which adapter sequences we want to trim, and (ii) where those adapter sequences are located on our computer. In our case, we know that the Illumina library preparation was carried out using a Nextera XT kit. Luckily, Trimmomatic has Nextera adapter sequences! To make it easier for you, you can simply download the Nextera adapter sequences from our remote server:
wget http://beagle.henlab.org/public/course_introtobioinfo/NexteraPE-PE.fa
If we type ls, we should see a file named
NexteraPE-PE.fa in our current directory:
ls
Can you guess which type of file this is?
Alternatively, if you want a challenge: when you installed
Trimmomatic using conda, the same exact Nextera adapter
sequence file was downloaded to your local computer! Can you find where
conda stored the adapter file? Try running the Trimmomatic
command below using this file!
trimmomatic PE -threads 1 -phred33 -trimlog log.txt -summary trimmomatic_summary.out mystery_salmonella_A_1.fastq.gz mystery_salmonella_A_2.fastq.gz mystery_salmonella_A_trimmedP_1.fastq.gz mystery_salmonella_A_trimmedS_1.fastq.gz mystery_salmonella_A_trimmedP_2.fastq.gz mystery_salmonella_A_trimmedS_2.fastq.gz ILLUMINACLIP:NexteraPE-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
What are we doing when we run this?
trimmomatic)PE)-threads 1)-phred33)-trimlog log.txt)-summary trimmomatic_summary.out)mystery_salmonella_A_1.fastq.gz)mystery_salmonella_A_2.fastq.gz)mystery_salmonella_A_trimmedP_1.fastq.gz; the
_1.fastq.gz suffix lets us know that the file contains forward reads,
while trimmedP lets us know that these reads are trimmed read
pairs)mystery_salmonella_A_trimmedS_1.fastq.gz because they are
trimmed forward SINGLE [unpaired] reads)mystery_salmonella_A_trimmedP_2.fastq.gz)mystery_salmonella_A_trimmedS_2.fastq.gz)ILLUMINACLIP:NexteraPE-PE.fa:2:30:10)LEADING:3)TRAILING:3)SLIDINGWINDOW:4:15)MINLEN:36)ls with the -l
and -h options to list our files and their respective sizes
in human-readable format; we can additionally supply
*.fastq.gz after the command, which allows us to only list
files ending in .fastq.gz:ls -lh *.fastq.gz
trimmomatic conda
environment:conda deactivate
fastqc conda environment:conda activate fastqc
fastqc mystery_salmonella_A_trimmedP_1.fastq.gz mystery_salmonella_A_trimmedP_2.fastq.gz
Once FastQC finishes, open the html files for your trimmed
forward and reverse reads in your browser (i.e.,
mystery_salmonella_A_trimmedP_1_fastqc.html and
mystery_salmonella_A_trimmedP_2_fastqc.html, respectively).
Has the quality of your reads improved?
Deactivate your conda environment:
conda deactivate
Babraham Bioinformatics. FastQC: a quality control tool for high throughput sequence data. Accessed April 14, 2019. URL: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Bolger, A.M., M. Lohse, and B. Usadel. 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30(15): 2114-2120. DOI: 10.1093/bioinformatics/btu170.
JGI. BBDuk: Decontamination Using Kmers. Accessed April 15, 2019. URL: https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/
Krueger, F. and Babraham Bioinformatics. Trim Galore: a wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files, with some extra functionality for MspI-digested RRBS-type (Reduced Representation Bisufite-Seq) libraries. Accessed April 15, 2019. URL: http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/.
Martin, M. 2011. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17(1): 10-12. DOI: https://doi.org/10.14806/ej.17.1.200.
Ritchie, D.M. 1984. The UNIX system: The evolution of the UNIX time-sharing system. AT&T Bell Laboratories Technical Journal 63(8): 1577-1593. DOI: 10.1002/j.1538-7305.1984.tb00054.x.
Schubert, M., S. Lindgreen, and L. Orlando. 2016. AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Research Notes 9: 88. DOI: 10.1186/s13104-016-1900-2.
Print the contents of a file:
cat <path/filename>
Move to a different directory:
cd <path>
Move up one directory:
cd ..
Move to the root directory:
cd /
Move to your home directory:
cd ~
Copy the source file to the destination:
cp <source> <destination>
Copy the source directory to the destination:
cp -r <source> <destination>
Download a file from an online source (wget):
wget filepath
Show disk usage:
df
Show directory space usage:
du
Find all instances of file in path with given name:
find <path> -name <pattern>
Show memory and swap usage in human-readable format:
free -h
Print all appearances of pattern in file:
grep <pattern> <path/filename>
Count all appearances of pattern in file:
grep -c <pattern> <path/filename>
Gzip file(s):
gzip path/filename
Decompress gzipped file(s):
gunzip path/filename.gz
Print the first 10 lines of a file:
head <path/filename>
Print system hostname:
hostname
List the given path:
ls <path>
List the given path, showing permissions, size, and modification dates in human readable format:
ls -lh <path>
List the given path, showing permissions, size, and modification dates in human readable format, sorting by filesize from largest to smallest:
ls -lhS <path>
List the given path, showing permissions, size, and modification dates in human readable format, sorting by filesize from smallest to largest:
ls -lhSr <path>
View the manual for a command (press q to exit
manual):
man <command>
Create a directory:
mkdir <path/directory_name>
Move the source file or directory to the destination:
mv <source> <destination>
Print working directory (where am I?):
pwd
Remove a file:
rm <path/filename>
Remove an empty directory:
rm -d <path/directory>
Remove a non-empty directory (use with EXTREME caution!):
rm -r <path/directory>
Secure copy a file from remote server to the dir
directory on your machine:
scp user@host:file dir
Secure copy a file from your machine to the dir
directory on a remote server:
scp file user@host:dir
Secure copy the directory foo from remote server to the
directory bar on your machine:
scp -r user@host:foo bar
Execute a shell script, script.sh:
sh script.sh
Connect to host as user:
ssh user@host
Print the last 10 lines of a file:
tail <path/filename>
Create a tar file, file.tar:
tar -cf file.tar path/directory
Extract a tar file, file.tar:
tar -xf file.tar
Create a gzipped tar file, file.tar.gz:
tar -czf file.tar.gz path/directory
Extract a gzipped tar file, file.tar.gz:
tar -xzf file.tar.gz
Print who you are logged in as:
whoami